In [1]:
%%html
<style>
    .container { width:800px !important; }
</style>
In [2]:
import pandas as pd
import importlib
import plotly.graph_objects as go
import plotly.express as px
import plotly
from scipy.stats import shapiro
from scipy.stats import levene
from scipy.stats import mannwhitneyu
import warnings
from IPython.display import display, Markdown, HTML

# Graphing, helper stuff
import src.graph as graph
import src.stats_utils as stats_utils
import src.pg_query as pg_query
import src.queries as queries
import src.graph_utils as graph_utils
In [3]:
importlib.reload(pg_query)
importlib.reload(queries)
importlib.reload(graph_utils);
In [4]:
# Config Settings
plotly.offline.init_notebook_mode()
VERBOSE = False

if not VERBOSE:
    warnings.filterwarnings("ignore")

Background¶

Spotify invested over $1 billion into podcasts between 2018 and 2022. The strategy focused on acquiring and producing exclusive podcast content to boost its market presence.

  • "The Joe Rogan Experience": ~$100 million deal, with 11 million listeners per episode.
  • "Archetypes" by the Sussexes: $20 million contract, discontinued after a single season.
  • Partnership with the Obamas: estimated $30million contract

Despite these significant expenditures, financial returns have lagged, with Spotify reporting a net loss of €430 million in 2022 and ongoing deficits. As a result, the company has reevaluated its approach, streamlining operations and canceling underperforming shows.

Spotify now focuses on sustainable growth through content diversification and bolstering podcast advertising. The success of these strategic adjustments in the competitive and fluctuating digital ad space remains to be seen.

Introduction¶

The purpose of this analysis is to examine whether Spotify made the correct choice of focusing on a small number of very expensive podcast deals or whether some other approaches might have made more sense longterm.

Specifically we'll look into:

  1. How successful was Spotify in growing it's podcast listener base?
  2. Did major podcasts outpace others in attracting listeners, justifying Spotify's past investment focus?

We'll be using Podcast Reviews dataset from Kaggle.

Popularity and average top 10% percentile count over time¶

In [5]:
reviews_col_names = queries.get_col_names(pg_query.engine)
if VERBOSE:
    display(reviews_col_names)
In [6]:
# TODO: ADD SUMMARY DATA LIKE
# FOR EACH TABLE
In [7]:
reviews_by_month_count_df = queries.get_reviews_by_month_count_df(pg_query.engine)
reviews_by_month_count_df_after_2015 = reviews_by_month_count_df[
    (reviews_by_month_count_df["year_month"] > "2015-01-01")
    & (
        reviews_by_month_count_df["year_month"]
        < reviews_by_month_count_df["year_month"].max()
    )
]
In [8]:
if VERBOSE:
    display(reviews_by_month_count_df.dtypes)
In [9]:
importlib.reload(graph)

fig = graph.render_fig_reviews_by_month_count_after_2015(
    reviews_by_month_count_df_after_2015
)
graph_utils.render_fig(fig)

We can see that the share of reviews for the top 5% and especially top 1% of podcasts started increasing significantly after 2018. This implies that their strategy was successful both in attracting new listeners and those listeners disproportionally listened to small number of most popular podcasts compare to the years before.

We see a significant falloff after 2021, this probably has several explanations:

  • media consumption in general decreased in the aftermath of the Covid pandemic.
  • we're using review count as a proxy for popularity, which is problematic because users can only leavy a single review for a podcast so we can't track whether they continued listening to those podcasts.

Some questions we need to consider, if Spotify's investment in podcasts (both specific and overall infrastructure) resulted in significant user growth, did most of these users:

  1. Disproportionately listen to these most expensive/most popular podcasts?
  2. If so, did these new users later engage with other podcasts as well?
  3. Were the new users retained over a longer period, or was there a significant drop-off?
Did most of the growth went to the top 1% of podcasts¶
Hypothesis I¶

Did Spotify's investment and overall strategy of focusing on a small number of creators prove effective? Specifically, did the growth rate in popularity of the most popular podcasts (defined as the top 1st percentile based on the number of reviews) exceed that of other podcasts? Based on this question, we formulate our first hypothesis:

H1: The number of reviews for the most popular podcasts is increasing at a faster rate than for the bottom 99% of all podcasts.

To test this hypothesis, follow these steps:

  1. Transform the reviews_by_month_count_df_after_2015 dataframe to show the monthly growth rate for the top 1% and bottom 99% of podcasts.
In [10]:
importlib.reload(stats_utils)

df = reviews_by_month_count_df.copy()[["year_month", "prop_top_1", "total_reviews"]]
df = df.sort_values("year_month")

df["year_month"] = pd.to_datetime(df["year_month"])
df["prop_bottom_99"] = 1 - df["prop_top_1"]
df["growth_rate_top_1"] = (df["total_reviews"] * df["prop_top_1"]).pct_change()
df["growth_rate_bottom_99"] = (
    df["total_reviews"] * (1 - df["prop_top_1"])
).pct_change()

df_after2020 = reviews_by_month_count_df[
    (reviews_by_month_count_df["year_month"] > "2020-01-01")
    & (
        reviews_by_month_count_df["year_month"]
        < reviews_by_month_count_df["year_month"].max()
    )
]

mean_top_1 = df["growth_rate_top_1"].mean()
std_top_1 = df["growth_rate_top_1"].std()

mean_bottom_99 = df["growth_rate_bottom_99"].mean()
std_bottom_99 = df["growth_rate_bottom_99"].std()

# Normality
shapiro_test_stat, shapiro_p_val = shapiro(df["growth_rate_top_1"].dropna())

# Homogeneity of Variances
levene_test_stat, levene_p_val = levene(
    df["growth_rate_top_1"].dropna(), df["growth_rate_bottom_99"].dropna()
)
u_stat, p_val_mannwhitney = mannwhitneyu(
    df["growth_rate_top_1"].dropna(), df["growth_rate_bottom_99"].dropna()
)

avg_top_1 = df["growth_rate_top_1"].mean()
avg_bottom_99 = df["growth_rate_bottom_99"].mean()
In [11]:
display(
    Markdown(
        f"""
**Growth rate for top 1%:**  
Mean: {round(mean_top_1, 2)} (Std Dev: {round(std_top_1, 2)})

**Growth rate for bottom 99%:**  
Mean: {round(mean_bottom_99, 2)} (Std Dev: {round(std_bottom_99, 2)})

To decide the appropriate test, the following should be considered:  
1. Data should be normally distributed or the sample size should be large.
2. Variances of the two groups being compared should be equal.

If these assumptions do not hold, a non-parametric test like the **Mann-Whitney U Test** should be used.

**Shapiro-Wilk Test for Normality:**  
Test Stat: {round(shapiro_test_stat, 2)}  
P-value: {round(shapiro_p_val, 2)} (If p-value < 0.05, data is not normally distributed).  
This indicates that a non-parametric test should be used.

**Levene Test for Homogeneity of Variances:**  
Test Stat: {round(levene_test_stat, 2)}  
P-value: {round(levene_p_val, 2)}  
This also supports the decision to use a non-parametric test.

**Mann-Whitney U Test:**  
U-value: {round(levene_test_stat, 2)}  
P-value: {round(levene_p_val, 2)}

The p-value indicates a significant difference in growth rates between the top 1% and bottom 99%. This mean we can reject the null hypothesis, which would support our initial hypothesis that the popularity of the top 1% most popular podcast has grown at a significantly faster pace than the rest.

**Comparison of Average Growth Rates:**  
Average growth for top 1%: {round(avg_top_1, 2)}  
Average growth for the bottom 99%: {round(avg_bottom_99, 2)}
"""
    )
)

Growth rate for top 1%:
Mean: 0.08 (Std Dev: 0.44)

Growth rate for bottom 99%:
Mean: 0.02 (Std Dev: 0.14)

To decide the appropriate test, the following should be considered:

  1. Data should be normally distributed or the sample size should be large.
  2. Variances of the two groups being compared should be equal.

If these assumptions do not hold, a non-parametric test like the Mann-Whitney U Test should be used.

Shapiro-Wilk Test for Normality:
Test Stat: 0.76
P-value: 0.0 (If p-value < 0.05, data is not normally distributed).
This indicates that a non-parametric test should be used.

Levene Test for Homogeneity of Variances:
Test Stat: 40.53
P-value: 0.0
This also supports the decision to use a non-parametric test.

Mann-Whitney U Test:
U-value: 40.53
P-value: 0.0

The p-value indicates a significant difference in growth rates between the top 1% and bottom 99%. This mean we can reject the null hypothesis, which would support our initial hypothesis that the popularity of the top 1% most popular podcast has grown at a significantly faster pace than the rest.

Comparison of Average Growth Rates:
Average growth for top 1%: 0.08
Average growth for the bottom 99%: 0.02

Growth rate for listeners of top 1% of podcasts has been significantly higher than for the remainder of podcasts. This in combination with the rapidly accelerating growth after 2018 would valid Spotify's approach. However we must remain cautious:

  • The podcast sector itself has been rapidly growing across all platforms especially during the Covid pandemic.
  • It's not clear if this still made financial sense, even if Spotify has been successful in increasing its listeners base their cost per gained user turned out to be unsustainable high financially.

Distribution of Podcasts by Popularity:¶

We'll further look into the distribution of podcasts by popularity to get a clearer picture of just how unequal the distribution of listeners is.

We'll use the Lorenz curve (which is commonly used in economic to measure wealth/income inequality and is tied to the GINI index) to visualize the disparity in podcast reviews, as it effectively highlights the degree of inequality.

In [12]:
importlib.reload(pg_query)
df_podcasts_by_rev_count = queries.get_podcasts_by_review_count(pg_query.engine);
In [13]:
df_podcasts_by_rev_count.columns
Out[13]:
Index(['podcast_id', 'title', 'num_reviews', 'categories'], dtype='object')
In [14]:
importlib.reload(graph)

df_podcasts_by_rev_count = df_podcasts_by_rev_count.sort_values(
    by="num_reviews"
)  # Sort
df_podcasts_by_rev_count["cumulative_share"] = (
    df_podcasts_by_rev_count["num_reviews"].cumsum()
    / df_podcasts_by_rev_count["num_reviews"].sum()
)

fig, gini = graph.render_fig_lorenz_curve(df_podcasts_by_rev_count)
graph_utils.render_fig(fig)

print(f"Gini coefficient is: {gini}")
Gini coefficient is: 0.93
In [15]:
total_reviews = df_podcasts_by_rev_count["num_reviews"].sum()

percentiles = [0.01, 0.05, 0.10, 0.50]
thresholds = df_podcasts_by_rev_count["num_reviews"].quantile(
    [1 - p for p in percentiles]
)

proportions = {}
for percentile, threshold in zip(percentiles, thresholds):
    proportion = df_podcasts_by_rev_count.loc[
        df_podcasts_by_rev_count["num_reviews"] >= threshold, "num_reviews"
    ].sum()
    proportions[f"Top {percentile*100:.0f}%"] = proportion

total_reviews = df_podcasts_by_rev_count["num_reviews"].sum()
proportions = {k: round((v / total_reviews) * 100, 1) for k, v in proportions.items()}
proportions_df = pd.DataFrame(
    list(proportions.items()), columns=["Percentile", "Proportion (%)"]
)

display(
    Markdown(
        "Distribution of podcast reviews by percentile (e.g. top 5% of all podcasts have 71.2% of all reviews):"
    )
)
print(proportions_df)

Distribution of podcast reviews by percentile (e.g. top 5% of all podcasts have 71.2% of all reviews):

  Percentile  Proportion (%)
0     Top 1%            43.1
1     Top 5%            71.2
2    Top 10%            81.5
3    Top 50%            98.1
User Engagement Analysis¶

In this part we'll look into user listening patterns and the effect of Spotify's recent investments on them. Specifically we want to look into whether the new users attracted to the platform were more likely to listen to 1% of most popular podcasts compared to users who have joined the platform at an earlier point.

In [16]:
importlib.reload(queries)
user_reviews_data_df = queries.get_user_reviews_data(pg_query.engine)
if VERBOSE:
    display(user_reviews_data_df);
In [17]:
display(
    Markdown(
        f"""Distribution of review count by user:
        
mean: {user_reviews_data_df["review_count"].mean():.2f}
median: {user_reviews_data_df["review_count"].median():.2f}
max: {user_reviews_data_df["review_count"].max():.0f}

stdev: {user_reviews_data_df["review_count"].std():.2f}
skewness*: {user_reviews_data_df['review_count'].skew():.2f}
*is extremely high and indicates a very strong rightward skewness. This suggests that most of the data values are clustered around the left, with a few extremely large values on the right.*

kurtosis**: {user_reviews_data_df['review_count'].kurtosis():.2f}

*direction and degree of asymmetry. A positive skew indicates that the tail is on the right side of the distribution.
** high kurtosis means more of the variance is the result of infrequent extreme deviations.
"""
    )
)

Distribution of review count by user: mean: 1.34 median: 1.00 max: 614

stdev: 1.83 skewness: 98.11 *is extremely high and indicates a very strong rightward skewness. This suggests that most of the data values are clustered around the left, with a few extremely large values on the right.

kurtosis**: 21137.37

*direction and degree of asymmetry. A positive skew indicates that the tail is on the right side of the distribution. ** high kurtosis means more of the variance is the result of infrequent extreme deviations.

We can see that the majority of users have left 1 review or less. While a small proportion of users have left a very large number of reviews. There are some users who have written hundreds of reviews which seems somewhat suspicious but since their number is very small it's not inconceivable that some users might have listened to hundreds of different podcasts over 4+ years.

In [18]:
bins = [0, 2, 3, 6, 9999999]
labels = ["1", "1-2", "2-5", "5+"]

user_reviews_data_df["bin"] = pd.cut(
    user_reviews_data_df["review_count"], bins=bins, labels=labels, right=False
)
user_reviews_data_df["other_reviews"] = (
    user_reviews_data_df["review_count"]
    - user_reviews_data_df["top_percentile_review_count"]
)

bin_counts = user_reviews_data_df["bin"].value_counts().sort_index()

total_users = len(user_reviews_data_df)
proportions = (bin_counts / total_users) * 100
proportions_text = [f"{p:.2f}%" for p in proportions]

top_review_bin_counts = user_reviews_data_df.groupby("bin")[
    "top_percentile_review_count"
].sum()
other_review_bin_counts = user_reviews_data_df.groupby("bin")["other_reviews"].sum()

total_bin_counts = user_reviews_data_df.groupby("bin")["review_count"].sum()

top_proportions = (top_review_bin_counts / total_bin_counts) * 100
other_proportions = (other_review_bin_counts / total_bin_counts) * 100

top_proportions_text = [f"{p:.2f}%" for p in top_proportions]
other_proportions_text = [f"{p:.2f}%" for p in other_proportions]

fig = go.Figure(
    data=[
        go.Bar(
            name="Other Reviews",
            x=labels,
            y=other_review_bin_counts.values,
            text=other_proportions_text,
            textposition="auto",
            textfont=dict(size=12),
        ),
        go.Bar(
            name="Top Percentile Reviews",
            x=labels,
            y=top_review_bin_counts.values,
            text=top_proportions_text,
            textposition="auto",
            textfont=dict(size=12),
        ),
    ]
)

fig.update_layout(
    barmode="stack",
    yaxis_title="Number of Users (Log Scale)",
    xaxis_title="Review Count Range",
    title="Distribution of User Reviews",
)

graph_utils.render_fig(fig)
In [19]:
if VERBOSE:
    display(user_reviews_data_df["first_review_is_top"].value_counts())
    display(user_reviews_data_df.columns)
In [20]:
# Calculate unique users per quarter
user_reviews_data_df["first_review_month"] = pd.to_datetime(
    user_reviews_data_df["first_review_month"]
)

user_reviews_data_df_after_2015 = user_reviews_data_df[
    (user_reviews_data_df["first_review_month"] > "2015-01-01")
    & (
        user_reviews_data_df["first_review_month"]
        < user_reviews_data_df["first_review_month"].max()
    )
]

unique_users_by_quarter = (
    user_reviews_data_df_after_2015.groupby(
        [
            user_reviews_data_df_after_2015["first_review_month"].dt.to_period("Q"),
            "first_review_is_top",
        ]
    )["author_id"]
    .nunique()
    .reset_index()
)
unique_users_by_quarter.columns = ["Quarter", "first_review_is_top", "Unique_Users"]
unique_users_by_quarter["Quarter"] = unique_users_by_quarter["Quarter"].astype(str)


fig = px.bar(
    unique_users_by_quarter,
    x="Quarter",
    y="Unique_Users",
    color="first_review_is_top",
    title="New Monthly Users by First Reviewed Podcast",
)

fig.update_xaxes(tickangle=-45)
fig.update_layout(legend_title_text="First reviewed podcast in top 1%")
fig.update_layout(
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
)

graph_utils.render_fig(fig)

The chart above tracks the number of new users (based on first left review) in a given quarter and show whether the first review a user left was for a top 1% podcast based on popularity.

We can see that the proportion of new users who have first listened to one of the most popular podcasts has been increasing inline with our previous findings.

Hypothesis II¶

H: There is a difference in the number of reviews left during the first 6 months on the platform between users who joined after 2020-01-01 and those who joined before

If we assume based on our prior analysis that a significantly higher proportion of users who have joined the platform after 2020 were attracted by one of the newly acquired or highly popular podcasts we want to check whether these users have stayed on the platform and listened to other podcasts as much as the users who had joined earlier.

Null Hypothesis (H0): There is no difference in the number of reviews left during the first 6 months on the platform between users who joined after 2020-01-01 and those who joined earlier.

We'll again use the Mann-Whitney U Test to check this:

In [21]:
cutoff_date = pd.to_datetime("2020-01-01")
user_reviews_data_df["first_review_month"] = pd.to_datetime(
    user_reviews_data_df["first_review_month"]
)
user_reviews_data_df["last_review_month"] = pd.to_datetime(
    user_reviews_data_df["last_review_month"]
)
max_date = user_reviews_data_df["last_review_month"].max()
six_months_ago = max_date - pd.DateOffset(months=6)

filtered_df = user_reviews_data_df[
    user_reviews_data_df["first_review_month"] <= six_months_ago
]

old_users = filtered_df[filtered_df["first_review_month"] < cutoff_date]
new_users = filtered_df[filtered_df["first_review_month"] >= cutoff_date]

u_statistic, p_value = mannwhitneyu(
    old_users["first_six_months_review_count"],
    new_users["first_six_months_review_count"],
    alternative="two-sided",
)

print(f"U statistic: {round(u_statistic, 2)}")
print(f"P-value: {round(p_value, 2)}")

if p_value < 0.05:
    # If the p-value is less than 0.05, we reject the null hypothesis.
    print(
        "There is a significant difference in the number of reviews left during the first six months between the two groups."
    )
else:
    print(
        "There is no significant difference in the number of reviews left during the first six months between the two groups."
    )

old_users_mean = old_users["first_six_months_review_count"].mean()
old_users_std = old_users["first_six_months_review_count"].std()

# Calculate the mean and standard deviation for new users
new_users_mean = new_users["first_six_months_review_count"].mean()
new_users_std = new_users["first_six_months_review_count"].std()

# Output the mean and standard deviation for both groups
print(
    f"Old Users - Mean: {old_users_mean:.2f}, Std Dev: {old_users_std:.2f}, Count: {len(old_users)}"
)
print(
    f"After 2020 Users - Mean: {new_users_mean:.2f}, Std Dev: {new_users_std:.2f}, Count: {len(new_users)}"
)
U statistic: 251977982392.5
P-value: 0.0
There is a significant difference in the number of reviews left during the first six months between the two groups.
Old Users - Mean: 1.13, Std Dev: 0.98, Count: 773917
After 2020 Users - Mean: 1.12, Std Dev: 0.99, Count: 648280

The Mann-Whitney U test indicates a significant difference with a P-value of 0.0, it suggests that there is a statistically significant difference in the distributions of the two groups.

However, the means and std. dev are almost identical. This is possibly due to the very large sample sizes and the statistical significance might not be practically meaningful.

In [22]:
old_users["bin"] = pd.cut(
    old_users["first_six_months_review_count"], bins=bins, labels=labels, right=False
)
new_users["bin"] = pd.cut(
    new_users["first_six_months_review_count"], bins=bins, labels=labels, right=False
)

old_user_bin_counts = old_users["bin"].value_counts().sort_index()
new_user_bin_counts = new_users["bin"].value_counts().sort_index()

old_user_proportions = (old_user_bin_counts / old_users["bin"].count()) * 100
new_user_proportions = (new_user_bin_counts / new_users["bin"].count()) * 100

fig = go.Figure(
    data=[
        go.Bar(
            name="Old Users",
            x=labels,
            y=old_user_proportions.values,
            text=[f"{p:.2f}%" for p in old_user_proportions],
            textposition="auto",
        ),
        go.Bar(
            name="New Users",
            x=labels,
            y=new_user_proportions.values,
            text=[f"{p:.2f}%" for p in new_user_proportions],
            textposition="auto",
        ),
    ]
)

fig.update_layout(
    barmode="group",
    yaxis=dict(title="Percentage of Total Users in Group", ticksuffix="%"),
    xaxis_title="Review Count Range in First 6 Months",
    title="Comparison of Review Count Proportions Between Users Who Have Joined Prior to 2020 and After",
)

graph_utils.render_fig(fig)

If we visualize it the difference is actually somewhat more discernible, users joining after 2020 have left more than 1 review slightly less often than older users. This would indicate that your hypothesis might be accurate and that new users are a bit less likely to explore the platform and listen to other podcasts. However the effect size is very small and probably not practically significant.

Podcast Genre/Genre Analysis¶

In this section we'll examine whether there is significant variance between review distribution based on podcast category/genre.

In [23]:
# Each podcast might belong to many categories, so we'll get a table with all podcast/category pairs
importlib.reload(queries)

pod_cat_pairs = queries.get_pod_cat_pairs(pg_query.engine)
if VERBOSE:
    display(pod_cat_pairs)

TOP_N = 20
In [24]:
# Podcasts might contain multiple categories like 'religion-spirituality' and 'religion', we want to look at only
# the top level category. So let's select the top level category only and drop all duplicate category entries for
# each podcast.


def get_top_level_category(pod_cat_pairs):
    pod_cat_pairs["top_level_category"] = (
        pod_cat_pairs["category"].str.split("-").str[0]
    )
    pod_cat_pairs_top_lvl_cat = pod_cat_pairs.drop_duplicates(
        ["podcast_id", "top_level_category"]
    )

    # 'true' is ambiguous keep true crime
    pod_cat_pairs_top_lvl_cat.loc[
        pod_cat_pairs_top_lvl_cat["category"].str.startswith("true-crime"),
        "top_level_category",
    ] = "true-crime"
    return pod_cat_pairs_top_lvl_cat


total_unique_podcasts = pod_cat_pairs["podcast_id"].nunique()

pod_cat_pairs_top_lvl_cat = get_top_level_category(pod_cat_pairs)
pod_cat_pairs_top_lvl_cat = pod_cat_pairs_top_lvl_cat.drop_duplicates(
    ["podcast_id", "top_level_category"]
)
In [24]:
 
In [25]:
def top_categories_with_unique_podcasts(data, col, n):
    category_stats = (
        data.groupby(col)
        .agg(
            review_count=pd.NamedAgg(column="review_count", aggfunc="sum"),
            unique_podcasts=pd.NamedAgg(column="podcast_id", aggfunc="nunique"),
        )
        .reset_index()
    )
    category_stats = category_stats.sort_values(by="review_count", ascending=False)
    top_categories = category_stats.head(n)

    return top_categories


category_counts = top_categories_with_unique_podcasts(
    pod_cat_pairs_top_lvl_cat, "category", TOP_N
)
category_counts_top_cat = top_categories_with_unique_podcasts(
    pod_cat_pairs_top_lvl_cat, "top_level_category", TOP_N
)

if VERBOSE:
    display(category_counts)
    display(category_counts_top_cat)
    display(pod_cat_pairs_top_lvl_cat.columns)
    display(category_counts_top_cat["top_level_category"].unique())
In [26]:
def add_prctile_group(df):
    # Calculate the 1% and 5% thresholds for each top_level_category
    thresholds_1 = df.groupby("top_level_category")["review_count"].quantile(0.99)
    thresholds_5 = df.groupby("top_level_category")["review_count"].quantile(0.95)

    def label_percentile(row):
        threshold_1 = thresholds_1.get(row["top_level_category"], 0)
        threshold_5 = thresholds_5.get(row["top_level_category"], 0)
        if row["review_count"] > threshold_1:
            return "Top 1%"
        elif row["review_count"] > threshold_5:
            return "1-5%"
        else:
            return "Bottom 95%"

    # Apply the labeling function
    df["percentile_group"] = df.apply(label_percentile, axis=1)
    return df
In [27]:
filtered_dataframe = pod_cat_pairs_top_lvl_cat[
    pod_cat_pairs_top_lvl_cat["top_level_category"].isin(
        category_counts_top_cat["top_level_category"]
    )
]

if VERBOSE:
    display(category_counts_top_cat["top_level_category"].unique())
    display(filtered_dataframe["top_level_category"].unique())

add_prctile_group(filtered_dataframe)

sum_df = (
    filtered_dataframe.groupby(["top_level_category", "percentile_group"])[
        "review_count"
    ]
    .sum()
    .reset_index()
)

sorted_categories = (
    sum_df.groupby("top_level_category")["review_count"]
    .sum()
    .sort_values(ascending=False)
    .index.tolist()
)

fig = px.bar(
    sum_df,
    x="top_level_category",
    y="review_count",
    color="percentile_group",
    title="Review Count by Top Level Category and Percentile Group",
    labels={"review_count": "Review Count", "top_level_category": "Top Level Category"},
    barmode="stack",
    category_orders={
        "top_level_category": sorted_categories,
        "percentile_group": ["Bottom 95%", "1-5%", "Top 1%"],
    },
)

total_reviews = sum_df.groupby("top_level_category")["review_count"].sum().to_dict()

for data in fig.data:
    proportions = [
        (count / total_reviews[category]) * 100
        for count, category in zip(data["y"], data["x"])
    ]
    data["text"] = [f"{p:.0f}%" for p in proportions]
    data["textposition"] = "auto"

fig.update_layout(legend_title_text="Distribution of reviews for podcasts in category")
fig.update_layout(
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
)

graph_utils.render_fig(fig);
In [28]:
pod_cat_pairs_top_lvl_cat = add_prctile_group(pod_cat_pairs_top_lvl_cat)
podcast_count = (
    pod_cat_pairs_top_lvl_cat.groupby("top_level_category")["podcast_id"]
    .nunique()
    .reset_index(name="num_podcasts")
)

top_1_percent_reviews = pod_cat_pairs_top_lvl_cat[
    pod_cat_pairs_top_lvl_cat["percentile_group"] == "Top 1%"
]
top_1_percent_sum = (
    top_1_percent_reviews.groupby("top_level_category")["review_count"]
    .sum()
    .reset_index(name="top_1_percent_reviews")
)

total_reviews = (
    pod_cat_pairs_top_lvl_cat.groupby("top_level_category")["review_count"]
    .sum()
    .reset_index(name="total_reviews")
)

merged_df = pd.merge(
    podcast_count, top_1_percent_sum, on="top_level_category", how="outer"
)
merged_df = pd.merge(merged_df, total_reviews, on="top_level_category", how="outer")
merged_df = merged_df[merged_df["total_reviews"] >= 20000]
merged_df["proportion_top_1"] = (
    merged_df["top_1_percent_reviews"] / merged_df["total_reviews"]
)

fig = px.scatter(
    merged_df,
    x="proportion_top_1",
    y="total_reviews",
    text="top_level_category",
    size="num_podcasts",
    labels={
        "proportion_top_1": "Proportion of Reviews in Top 1%",
        "total_reviews": "Total Review Count",
    },
    title="Proportion of Top 1% Reviews vs Total Review Count",
)

fig.update_traces(textposition="top center")

graph_utils.render_fig(fig)
In [29]:
if VERBOSE:
    pod_cat_pairs_top_lvl_cat
    pod_cat_pairs_top_lvl_cat["percentile_group"].value_counts()
In [30]:
top_categories = category_counts_top_cat["top_level_category"]
top_podcasts_df = pod_cat_pairs_top_lvl_cat[
    pod_cat_pairs_top_lvl_cat["top_level_category"].isin(top_categories)
]

top_podcasts_df["top_1_prct"] = top_podcasts_df["percentile_group"].map(
    lambda l: l == "Top 1%"
)

review_changes = (
    top_podcasts_df.groupby(["top_level_category", "top_1_prct"])
    .agg({"reviews_2018_2020": "sum", "reviews_2020_2022": "sum"})
    .reset_index()
)

review_changes["percent_change"] = (
    review_changes["reviews_2020_2022"] - review_changes["reviews_2018_2020"]
) / review_changes["reviews_2018_2020"]
review_changes["total_reviews_sum"] = (
    review_changes["reviews_2018_2020"] + review_changes["reviews_2020_2022"]
)
sorted_review_changes = review_changes.sort_values(
    by="total_reviews_sum", ascending=False
)

fig = px.bar(
    sorted_review_changes,
    x="top_level_category",
    y="percent_change",
    color="top_1_prct",  # Color based on the 'top_1_prct' group
    title="Percentage Change in Number of Podcast Reviews for Top Categories",
    labels={"percent_change": "Percentage Change", "top_level_category": "Category"},
    barmode="group",
    hover_data=["reviews_2018_2020", "reviews_2020_2022"],
)

fig.update_traces(
    hovertemplate="<br>".join(
        [
            "Category: %{x}",
            "Reviews 2018-2020: %{customdata[0]:,}",
            "Reviews 2020-2022: %{customdata[1]:,}",
            "Percent Change: %{y:.2%}",
        ]
    )
)


fig.update_layout(
    yaxis_tickformat=".0%",
    yaxis_title="Percent Change in Review Count",
    xaxis_tickangle=-45,
)
fig.update_layout(legend_title_text="Top 1% Podcasts")
fig.update_layout(
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
)

graph_utils.render_fig(fig)

The red bars show the increase in popularity of the top 1% of podcasts (relative to category). We can see there is a lot of variance between categories. In some case the popularity of top 1% podcasts has decreased (i.e. the distribution of listeners between different podcasts increased) which goes against the overal trend we have observed previously.

Summary¶

Our analysis indicates that Spotify's strategy to invest heavily in a select portfolio of podcasts resulted in a skewed growth pattern: the top 1% of podcasts, likely including expensive exclusives, saw review counts—and by proxy, popularity—grow faster than the remaining 99%.

However, post-2021 data (including external sources) reveals a downturn, questioning the longevity of listener interest. While these expensive/popular shows initially captured disproportionate listener attention, whether this translated into a broader, sustained engagement across Spotify's podcast spectrum is unclear. The challenge ahead for Spotify is to leverage early gains from high-profile investments to cultivate a diverse, enduring podcast ecosystem.

Limitations¶

The core of this analysis relies on the assumption that there is a very strong correlation between the number of reviews a podcast and the number of listeners.